Integrating both Visual and Audio Cues for Enhanced Video Caption

نویسندگان

  • Wangli Hao
  • Zhaoxiang Zhang
  • He Guan
  • Guibo Zhu
چکیده

Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visualaudio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module. Introduction Automatically describing video with natural sentences has potential applications in many fields, such as humanrobot interaction, video retrieval. Recently, benefiting from extraordinary abilities of convolutional neural networks (CNN) (Simonyan and Zisserman 2014; Szegedy et al. 2015; Szegedy et al. 2016), recurrent neural networks (RNN) (Hochreiter and Schmidhuber 1997) and large paired video language description datasets (Xu et al. 2016), video caption has achieved promising successes. Most video caption frameworks can be simply splitted into a encoder stage and a decoder stage respectively. Conditioned on a fixed length of visual feature representation offered by encoder, decoder can generate a corresponding video description recurrently. To generate a fixed length video representation, several methods are proposed, ∗Corresponding author. ([email protected]) Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. such as pooling over frames (Venugopalan et al. 2014), holistic video representations (Gua ; Rohrbach et al. 2015; Rohrbach et al. 2013), sub-sampling on a fixed number of input frames (Yao et al. 2015) and extracting the last hidden state of recurrent visual feature encoder (Venugopalan et al. 2015). Those feature encoding methods mentioned above are only based on visual cues. However, videos contain the visual modality and the audio modality. The resonance information underlying them is essential for video caption generation. We believe that the lack of arbitrary modality will result in the loss of information. For example, when a person is lying on the bed and singing a song, traditional video caption methods may generate an incomplete sentence, like ”a person is lying on the bed”, which may due to the loss of resonance information underling audio modality. If audio features can be integrated into video caption framework, precise sentence ”a person is lying on the bed and singing” will be expected to generate. To thoroughly utilize both visual and audio information, we propose and analyze three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on crossmodalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Furthermore, a dynamic multimodal feature fusion framework is also proposed to deal with audio modality absent problem during video caption generation. The contributions of our paper include: a. We present three multimodal feature fusion strategies, to efficiently integrate audio cues into video caption. b. We propose an audio modality inference module to handle audio modality absent problem, through generating audio feature based on the corresponding visual feature of the video. c. Our experimental results based on Microsoft ResearchVideo to Text (MSR-VTT) and Microsoft Video Description (MSVD) datasets show that our multimodal feature fusion frameworks lead to the improved results in video caption. ar X iv :1 71 1. 08 09 7v 1 [ cs .C V ] 2 2 N ov 2 01 7 Related works Video Caption Early works concerning on video caption can be classified into three groups. The first category is template-based methods. They first identified the semantic attributes hidden in videos and then derived a sentence structure based on some predefined sentence templates (Krishnamoorthy et al. 2013; Thomason et al. 2014). Then, probabilistic graphical model was utilized to collect the most relevant contents in videos to generate the corresponding sentence. Although sentences generated by these models seemed to be grammatically correct, they were lack of richness and flexibility. The second category treat video caption as a retrieval problem. They tagged videos with metadata (Aradhye, Toderici, and Yagnik 2009) and then clustered videos and captions based on these tags. Although the generated sentences were more naturally compared to the first group, they were subject to the metadata seriously. The third category of video caption methods directly map visual representation into specific provided sentences (Venugopalan et al. 2014; Yao et al. 2015; Pan et al. 2016a; Pan et al. 2016b), which take inspiration from image caption (Vinyals et al. 2015; Donahue et al. 2015). We argue that these video caption methods only rely on visual information while ignoring audio cues, which will restrict the performance of video caption. To handle this problem, we explore to incorporate audio information into video caption. Exploiting Audio Information from Videos Audio sequence underlying videos always carry meaningful information. Recently, many researchers have tried to incorporate audio information into their specific applications. In (Owens et al. 2016), Owens et al. adopted ambient sounds as a supervisory signal for training visual models, their experiments showed that units of trained network supervised by sound signals carried semantic meaningful information about objects and scenes. Ren et al. (Ren et al. 2016) proposed a multimodal Long Short-Term Memory (LSTM) for speaker identification, which referred to locating a person who has the same identity with the ongoing sound in a certain video. Their key point was sharing weights across face and voice to model the temporal dependency over these two different modalities. Inspired by (Ren et al. 2016), we propose to build temporal dependency across visual and audio modalities through sharing weights for video caption, aiming at exploring whether temporal dependency across visual and audio modalities can capture the resonance information among them or not. Memory Extended Recurrent Neural Network Internal memory in RNN can preserve valuable information for specific tasks. However, it cannot well handle the tasks which need long-term temporal dependency. To enhance the memory ability of RNN, an external memory has been utilized to extend RNN in some works, such as Neural Turing Machine(NTM) (Graves, Wayne, and Danihelka 2014), memory network (Weston, Chopra, and Bordes 2014), which is simply dubbed as memory enhanced RNN (ME-RNN). ME-RNNs have been widely applied in many tasks. Besides handling single task which needs long temporal dependency, such as visual question answering (Xiong, Merity, and Socher 2016) and dialog systems (Dodge et al. 2015), ME-RNNs have been adopted for multi-tasks to model long temporal dependency across different tasks (Liu, Qiu, and Huang 2016). To explore whether long visual-audio temporal dependency can capture the resonance information among two modalities, we first try to build a visual-audio shared memory across visual and audio modalities for video caption. Methods In this section, we first introduce the basic video caption framework that our work is based on. Then, three multimodal feature fusion strategies are depicted for video caption respectively. Meanwhile, dynamic multimodal feature fusion framework and its core component AMIN are also presented. LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generating Natural Video Descriptions via Multimodal Processing

Generating natural language descriptions of visual content is an intriguing task which has wide applications such as assisting blind people. The recent advances in image captioning stimulate further study of this task in more depth including generating natural descriptions for videos. Most works of video description generation focus on visual information in the video. However, audio provides ri...

متن کامل

Predicting Visual Features from Text for Image and Video Caption Retrieval

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute Word2VisualVec, a deep neural network architecture that learns to predict...

متن کامل

Retrieving of Video Scenes Using Arabic Closed-caption

The increased use of video documents for multimedia-based applications has created a demand for strong video database support, including efficient methods for browsing and retrieving video data. Most solutions to video browsing and retrieval of video data rely on visual information only, ignoring the rich source of the accompanying audio signal and texts. Speech is the significant information t...

متن کامل

Detection of slide transition for topic indexing

This paper presents an automatic and novel approach in detecting the transitions of slides for video sequences of technical lectures. Our approach adopts a foreground vs background segmentation algorithm to separate a presenter from the projected electronic slides. Once a background template is generated, text captions are detected and analyzed. The segmented caption regions as well as backgrou...

متن کامل

Comparing the Impact of Audio-Visual Input Enhancement on Collocation Learning in Traditional and Mobile Learning Contexts

: This study investigated the impact of audio-visual input enhancement teaching techniques on improving English as Foreign Language (EFL) learnersˈ collocation learning as well as their accuracy concerning collocation use in narrative writing. In addition, it compared the impact and efficiency of audio-visual input enhancement in two learning contexts, namely traditional and mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1711.08097  شماره 

صفحات  -

تاریخ انتشار 2017